Case Study 2

## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel

Data import

Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame':    870 obs. of  36 variables:
 $ ID                      : num  1 2 3 4 5 6 7 8 9 10 ...
 $ Age                     : num  32 40 35 32 24 27 41 37 34 34 ...
 $ Attrition               : chr  "No" "No" "No" "No" ...
 $ BusinessTravel          : chr  "Travel_Rarely" "Travel_Rarely" "Travel_Frequently" "Travel_Rarely" ...
 $ DailyRate               : num  117 1308 200 801 567 ...
 $ Department              : chr  "Sales" "Research & Development" "Research & Development" "Sales" ...
 $ DistanceFromHome        : num  13 14 18 1 2 10 5 10 10 10 ...
 $ Education               : num  4 3 2 4 1 2 5 4 4 4 ...
 $ EducationField          : chr  "Life Sciences" "Medical" "Life Sciences" "Marketing" ...
 $ EmployeeCount           : num  1 1 1 1 1 1 1 1 1 1 ...
 $ EmployeeNumber          : num  859 1128 1412 2016 1646 ...
 $ EnvironmentSatisfaction : num  2 3 3 3 1 4 2 4 3 4 ...
 $ Gender                  : chr  "Male" "Male" "Male" "Female" ...
 $ HourlyRate              : num  73 44 60 48 32 32 90 88 87 92 ...
 $ JobInvolvement          : num  3 2 3 3 3 3 4 2 3 2 ...
 $ JobLevel                : num  2 5 3 3 1 3 1 2 1 2 ...
 $ JobRole                 : chr  "Sales Executive" "Research Director" "Manufacturing Director" "Sales Executive" ...
 $ JobSatisfaction         : num  4 3 4 4 4 1 3 4 3 3 ...
 $ MaritalStatus           : chr  "Divorced" "Single" "Single" "Married" ...
 $ MonthlyIncome           : num  4403 19626 9362 10422 3760 ...
 $ MonthlyRate             : num  9250 17544 19944 24032 17218 ...
 $ NumCompaniesWorked      : num  2 1 2 1 1 1 2 2 1 1 ...
 $ Over18                  : chr  "Y" "Y" "Y" "Y" ...
 $ OverTime                : chr  "No" "No" "No" "No" ...
 $ PercentSalaryHike       : num  11 14 11 19 13 21 12 14 19 14 ...
 $ PerformanceRating       : num  3 3 3 3 3 4 3 3 3 3 ...
 $ RelationshipSatisfaction: num  3 1 3 3 3 3 1 3 4 2 ...
 $ StandardHours           : num  80 80 80 80 80 80 80 80 80 80 ...
 $ StockOptionLevel        : num  1 0 0 2 0 2 0 3 1 1 ...
 $ TotalWorkingYears       : num  8 21 10 14 6 9 7 8 1 8 ...
 $ TrainingTimesLastYear   : num  3 2 2 3 2 4 5 5 2 3 ...
 $ WorkLifeBalance         : num  2 4 3 3 3 2 2 3 3 2 ...
 $ YearsAtCompany          : num  5 20 2 14 6 9 4 1 1 8 ...
 $ YearsInCurrentRole      : num  2 7 2 10 3 7 2 0 1 2 ...
 $ YearsSinceLastPromotion : num  0 4 2 5 1 1 0 0 0 7 ...
 $ YearsWithCurrManager    : num  3 9 2 7 3 7 3 0 0 7 ...
 - attr(*, "spec")=
  .. cols(
  ..   ID = col_double(),
  ..   Age = col_double(),
  ..   Attrition = col_character(),
  ..   BusinessTravel = col_character(),
  ..   DailyRate = col_double(),
  ..   Department = col_character(),
  ..   DistanceFromHome = col_double(),
  ..   Education = col_double(),
  ..   EducationField = col_character(),
  ..   EmployeeCount = col_double(),
  ..   EmployeeNumber = col_double(),
  ..   EnvironmentSatisfaction = col_double(),
  ..   Gender = col_character(),
  ..   HourlyRate = col_double(),
  ..   JobInvolvement = col_double(),
  ..   JobLevel = col_double(),
  ..   JobRole = col_character(),
  ..   JobSatisfaction = col_double(),
  ..   MaritalStatus = col_character(),
  ..   MonthlyIncome = col_double(),
  ..   MonthlyRate = col_double(),
  ..   NumCompaniesWorked = col_double(),
  ..   Over18 = col_character(),
  ..   OverTime = col_character(),
  ..   PercentSalaryHike = col_double(),
  ..   PerformanceRating = col_double(),
  ..   RelationshipSatisfaction = col_double(),
  ..   StandardHours = col_double(),
  ..   StockOptionLevel = col_double(),
  ..   TotalWorkingYears = col_double(),
  ..   TrainingTimesLastYear = col_double(),
  ..   WorkLifeBalance = col_double(),
  ..   YearsAtCompany = col_double(),
  ..   YearsInCurrentRole = col_double(),
  ..   YearsSinceLastPromotion = col_double(),
  ..   YearsWithCurrManager = col_double()
  .. )

The data set has 870 observations and 36 variables for us to work with.

Pre-Processing

Here I am doing some processing of the data. I am going to convert some character columns to factors to make modeling easier.

Exploratory Analysis

Lets start by looking at our dependent variables

Attrition

Attrition Count Proportion
No 730 0.8390805
Yes 140 0.1609195

Out of the 870 employees, 140 left their jobs, which is 16% attrition

Monthly Income

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1081    2840    4946    6390    8182   19999 

We have Monthly Incomes ranging from $1,081 to $19,999. Income is very right skewed, which will effect modeling. We probably want to transform it.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  6.986   7.951   8.506   8.538   9.010   9.903 

The log version of Monthly Income seems more normal but wasn’t as good as I had hoped from the transformation.

Business Travel

BusinessTravel Count Proportion
Non-Travel 94 0.1080460
Travel_Frequently 158 0.1816092
Travel_Rarely 618 0.7103448

681 employees or 71% travel rarely. It seems like the most frequent travelers have the highest attrition rates, and non-travelers have the lowest. Non-travelers have a lower 75th percentile on income, Frequent and non-frequent travelers have similar pay.

Department

Department Count Proportion
Human Resources 35 0.0402299
Research & Development 562 0.6459770
Sales 273 0.3137931

The sales department has the highest attrition rate and R&D has the lowest attrition rate. Mean income is fairly similar but their is a clear difference in medians with HR having the lowest median pay and Sales having the highest median pay.

Education

Education Count Proportion
1 98 0.1126437
2 182 0.2091954
3 324 0.3724138
4 240 0.2758621
5 26 0.0298851

Assuming higher values of Education mean more advance education, better educated employees seem to have lower attrtion rates. 1-3 are similar but 4 & 5 have a clear decrease. Education 5 also has a very clear pay advantage.

EducationField Count Proportion
Human Resources 15 0.0172414
Life Sciences 358 0.4114943
Marketing 100 0.1149425
Medical 270 0.3103448
Other 52 0.0597701
Technical Degree 75 0.0862069

Those with education in HR have the highest attrition rates in this sample. Education in Human Resources seems to come with lower median pay, but those educated in marketing have higher median pay

Environment Satisfaction

EnvironmentSatisfaction Count Proportion
1 172 0.1977011
2 178 0.2045977
3 258 0.2965517
4 262 0.3011494

Assuming lower is less satisfied, those least satisfied with their enviroment have the highest attrition rates. Mean and Median pay is similar between groups here, but the 75th percentile is lower for groups 2&4. Not sure how this is meaningful to pay overall.

Gender

Gender Count Proportion
Female 354 0.4068966
Male 516 0.5931034

Attrition rates are similar between genders, with Men having just a slightly higher rate. In terms of pay, Men are making a little less than average in this sample. Lets look at some other cuts to see if there is any deeper story.

Men have a little better median pay in the Sales department.

In terms of mean and median pay, men rarely have an advantage regardless of education field.

JobRole

JobRole Count Proportion
Healthcare Representative 76 0.0873563
Human Resources 27 0.0310345
Laboratory Technician 153 0.1758621
Manager 51 0.0586207
Manufacturing Director 87 0.1000000
Research Director 51 0.0586207
Research Scientist 172 0.1977011
Sales Executive 200 0.2298851
Sales Representative 53 0.0609195

Sales Representatives have the highest attrition rate by far. Managers and Research Directors have the highest pay and low attrition rates.

MaritalStatus

MaritalStatus Count Proportion
Divorced 191 0.2195402
Married 410 0.4712644
Single 269 0.3091954

Single employees have the highest attrition and below average pay.

Over18

Over18 Count Proportion
Y 870 1

This seems to be an indicator for the employee being over 18, and 100% of our sample falls into this category. Since there is no variance in this variable it would not be useful for modeling so I am omitting any further analysis.

Over Time

OverTime Count Proportion
No 618 0.7103448
Yes 252 0.2896552

Overtime eligible employees have much higher attrition rates and slightly lower monthly income.

Stock Option Level

StockOptionLevel Count Proportion
0 379 0.4356322
1 355 0.4080460
2 81 0.0931034
3 55 0.0632184

Most employees fall in Stock Option level 0 or 1, with much fewer in 2 & 3. Stock option levels 0 & 3 have the highest attrition, with 1 & 2 being much lower. Perhaps 1 & 2 have more long term incentives to keep employees at the firm. 1 & 2 also have higher median / average pay.

Job Level

JobLevel Count Proportion
1 329 0.3781609
2 312 0.3586207
3 132 0.1517241
4 60 0.0689655
5 37 0.0425287

Job level 1 has a much higher attrition rate. Job level also seems very linearly associated with pay.

Job Involvement

JobInvolvement Count Proportion
1 47 0.0540230
2 228 0.2620690
3 514 0.5908046
4 81 0.0931034

Those with lower job involvement have much higher attrition. The relationship with pay is much less clear.

Daily Rate and Hourly Rate

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  103.0   472.5   817.5   815.2  1165.8  1499.0 

I am not seeing any meaningful relationship to attrition or Monthly Income for Daily Rate

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  30.00   48.00   66.00   65.61   83.00  100.00 

I am not seeing any meaningful relationship to attrition or Monthly Income for Hourly Rate, and Hourly Rate doesn’t seem related to Daily Rate. I would normally assoicate Hourly/Daily rates with income but in this case either this is not true or somehow doesn’t translate to monthly income. I will probably ignore these.

Distance From Home

DistHomeFactor Count Proportion
Close 600 0.6896552
Far 127 0.1459770
Medium 143 0.1643678

Those who attrite work farther from home, on average.

Num Companies Worked

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   1.000   2.000   2.728   4.000   9.000 

It seems like those that attrite have worked at fewer companies?

ID, Employee Number, Employee Count

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1       1       1       1       1       1 
[1] 870
[1] 870

These are quick checks to make sure each row represents one employee, and no employees have more than one observation. Employee Count never takes a value other than one, so each row is one employee. ID and Employee number are unique, there are no repeats, so no employee has more than one observation.

Naive Bayes

Lets quick try a model with everything in it.

Confusion Matrix and Statistics

          Reference
Prediction No Yes
       No  86   6
       Yes 59  21
                                          
               Accuracy : 0.6221          
                 95% CI : (0.5451, 0.6948)
    No Information Rate : 0.843           
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.2062          
                                          
 Mcnemar's Test P-Value : 1.12e-10        
                                          
            Sensitivity : 0.7778          
            Specificity : 0.5931          
         Pos Pred Value : 0.2625          
         Neg Pred Value : 0.9348          
             Prevalence : 0.1570          
         Detection Rate : 0.1221          
   Detection Prevalence : 0.4651          
      Balanced Accuracy : 0.6854          
                                          
       'Positive' Class : Yes             
                                          
[1] 0.7257384

Not too bad. Specificity is just under the desired threshold.

Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  126  14
       Yes  19  13
                                          
               Accuracy : 0.8081          
                 95% CI : (0.7412, 0.8641)
    No Information Rate : 0.843           
    P-Value [Acc > NIR] : 0.9107          
                                          
                  Kappa : 0.3259          
                                          
 Mcnemar's Test P-Value : 0.4862          
                                          
            Sensitivity : 0.48148         
            Specificity : 0.86897         
         Pos Pred Value : 0.40625         
         Neg Pred Value : 0.90000         
             Prevalence : 0.15698         
         Detection Rate : 0.07558         
   Detection Prevalence : 0.18605         
      Balanced Accuracy : 0.67522         
                                          
       'Positive' Class : Yes             
                                          
[1] 0.8842105
Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  101  10
       Yes  44  17
                                         
               Accuracy : 0.686          
                 95% CI : (0.611, 0.7545)
    No Information Rate : 0.843          
    P-Value [Acc > NIR] : 1              
                                         
                  Kappa : 0.2157         
                                         
 Mcnemar's Test P-Value : 7.098e-06      
                                         
            Sensitivity : 0.62963        
            Specificity : 0.69655        
         Pos Pred Value : 0.27869        
         Neg Pred Value : 0.90991        
             Prevalence : 0.15698        
         Detection Rate : 0.09884        
   Detection Prevalence : 0.35465        
      Balanced Accuracy : 0.66309        
                                         
       'Positive' Class : Yes            
                                         
[1] 0.7890625
Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  107  11
       Yes  38  16
                                          
               Accuracy : 0.7151          
                 95% CI : (0.6414, 0.7812)
    No Information Rate : 0.843           
    P-Value [Acc > NIR] : 0.9999930       
                                          
                  Kappa : 0.2349          
                                          
 Mcnemar's Test P-Value : 0.0002038       
                                          
            Sensitivity : 0.59259         
            Specificity : 0.73793         
         Pos Pred Value : 0.29630         
         Neg Pred Value : 0.90678         
             Prevalence : 0.15698         
         Detection Rate : 0.09302         
   Detection Prevalence : 0.31395         
      Balanced Accuracy : 0.66526         
                                          
       'Positive' Class : Yes             
                                          
[1] 0.8136882
Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  106  12
       Yes  39  15
                                          
               Accuracy : 0.7035          
                 95% CI : (0.6292, 0.7706)
    No Information Rate : 0.843           
    P-Value [Acc > NIR] : 0.9999987       
                                          
                  Kappa : 0.2037          
                                          
 Mcnemar's Test P-Value : 0.0002719       
                                          
            Sensitivity : 0.55556         
            Specificity : 0.73103         
         Pos Pred Value : 0.27778         
         Neg Pred Value : 0.89831         
             Prevalence : 0.15698         
         Detection Rate : 0.08721         
   Detection Prevalence : 0.31395         
      Balanced Accuracy : 0.64330         
                                          
       'Positive' Class : Yes             
                                          
[1] 0.8060837

William Arnost

2019-08-16